Fast GPGPU Data Rearrangement Kernels using CUDA

نویسندگان

  • Michael Bader
  • Hans-Joachim Bungartz
  • Dheevatsa Mudigere
  • Srihari Narasimhan
  • Babu Narayanan
چکیده

* Corresponding author – [email protected]. Graduate student at TUM, work carried out the GE-Global research working towards a master thesis at TUM. Abstract: Many high performance computing algorithms are bandwidth limited, hence the need for optimal data rearrangement kernels as well as their easy integration into the rest of the application. In this work, we have built a CUDA library of fast kernels for a set of data rearrangement operations. In particular, we have built generic kernels for rearranging m dimensional data into n dimensions, including Permute, Reorder, Interlace/Deinterlace, etc. We have also built kernels for generic Stencil computations on a two-dimensional data using templates and functors that allow application developers to rapidly build customized high performance kernels. All the kernels built achieve or surpass best-known performance in terms of bandwidth utilization.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Technical Report: GIT-CERCS-09-06 A Characterization and Analysis of GPGPU Kernels

General purpose application development for GPUs (GPGPU) has recently gained momentum as a cost-effective approach for accelerating dataand compute-intensive applications, pushed to the forefront by the introduction of Cbased programming environments such as NVIDIA’s CUDA, [1], OpenCL [2], and Intel’s Ct [3]. While significant effort has been focused on developing and evaluating applications an...

متن کامل

Data access optimized applications on the GPU using NVIDIA CUDA

This work is an attempt to address the problem of bandwidth limited performance of data intensive GPGPU applications. Performance limited by memory bandwidth is common issue faced by general data intensive HPC applications. In case of the GPU, this problem is more pronounced owing to the unique architecture. This problem has been tackled by optimizing basic data rearrangement operations on the ...

متن کامل

Developing a High Performance Gpgpu Compiler Using Cetus

In this paper we present our experience in developing an optimizing compiler for general purpose computation on graphics processing units (GPGPU) based on the Cetus compiler framework. The input to our compiler is a naïve GPU kernel procedure, which is functionally correct but without any consideration for performance optimization. Our compiler applies a set of optimization techniques to the na...

متن کامل

Adaptable and Efficient Variable Size Template Matching in CUDA

Introduction Increasingly flexible GPUs and the advent of GPGPU (General Purpose GPU) languages, such as Nvidia’s CUDA and the OpenCL standard, offer potential peak performance that far exceeds that of general purpose CPUs for a variety of problems. However, architectural and programming restrictions often prevent programmers from achieving peak performance. Even for problems that map well to c...

متن کامل

FATSEA – An Architectural Simulator for General Purpose Computing on GPUs

We present FATSEA, a functional and performance evaluation simulator written in C++ to handle kernels written in the CUDA programming language aimed for GPGPU computing. FATSEA takes a Parallel Thread eXecution (PTX ) code as input, which is a device independent code format generated by the Nvidia CUDA compiler, to validate results and estimate performance on Nvidia platforms. This paper shows ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1011.3583  شماره 

صفحات  -

تاریخ انتشار 2009